Bayesian inference was initially developed by Reverend Thomas Bayes, but his ruminations on inverse probability wouldn’t be known until a friend polished and submitted his work to the Royal Society. Bayes’ work was eventually developed and refined by Laplace. Bayesian inference was wildly different from Fisher’s work in defining classical statistical theory (Lesaffre and Lawson 2012).
In opposition to the Bayesian approach is the frequentist approach. The frequentist approach considers the parameter of interest fixed and inference on the parameter is based on the result of repeated sampling. In the Bayesian approach, the parameter of interest is not fixed but stochastic, and inference on the parameter is based on observed data and prior knowledge (Lesaffre and Lawson 2012).
A benefit of the Bayesian approach lies in the ability to include prior knowledge through the selection of a prior. Priors can be subjective or objective. Subjective priors incorporate opinions, of an individual or of a group, which can negatively impact the perceived integrity of the findings. Objective priors are preferred which follow formal rules for determining noninformative priors (Lesaffre and Lawson 2012).
When prior knowledge is lackluster, has little information, or is otherwise not sufficient, a noninformative prior may be used. A common choice for a noninformative prior, in cases with binomial likelihood, is a uniform distribution, also known as a flat prior. In cases with a Gaussian likelihood, the noninformative prior would be a normal prior with \(\sigma^2 \to \infty\) which functions similarly to the flat prior. For cases with a Poisson likelihood, a Gamma(\(\alpha_0\), \(\beta_0\)) prior is used where the sum of the counts are \((\alpha_0 - 1)\) and the experiment size is \(\beta_0\)(Lesaffre and Lawson 2012). For normal linear regression models, conjugate normal-gamma priors can be used to provide a resulting posterior that is of the same family(Yan and Su 2009).
There are a variety of ways to summarize the posterior in order to derive conclusions about the parameter. Its location and variability can be specified by finding the mode, mean, and median of the distribution. Its range can be defined with the equal tail credible interval (not to be confused with the confidence interval in the frequentist approach) or with the high posterior density (HDP) interval. Future predictions for the parameter can be made through posterior predictive distributions (PPD) which factor out \(\theta\) with the assumption that past data will not influence future data, hierarchical independence(Lesaffre and Lawson 2012).
It is not uncommon for the marginal posterior density function of a parameter to be unavailable, requiring an alternate approach to extract insight. It is safe to assert that sampling techniques are used in nearly all Bayesian analyses(Yan and Su 2009). General purpose sampling algorithms available include the inverse ICDF method, the accept-reject algorithm, importance sampling, and the commonly used Monte Carlo integration. The Monte Carlo integration replaces the integral of the posterior with the average obtained from randomly sampled values to provide an expected value. Two popular Markov chain Monte Carlo (MCMC) methods are the Gibbs sampler and the Metropolis-Hastings (MH) algorithm(Lesaffre and Lawson 2012).
2 Methods
2.1 The Frequentist Framework
Linear regression can be achieved using a variety of methods, two of interest are frequentist and Bayesian. The frequentist approach to linear regression is the more familiar approach. It estimates the effects of independent variables(predictors) on dependent variables(the outcome). The regression coefficient is a point estimate, assumed to be a fixed value. Following is the frequentist linear model
The Bayesian approach estimates the relationship between predictors and an outcome in a similar way, however it’s regression coefficient is not a point estimate, but a distribution. That is, the regression coefficient is not assumed to be a fixed value. The Bayesian approach also goes a step further then frequentist regression in it’s inclusion of prior data. The Bayesian approach is so named because it is based on Bayes’ rule which is written as follows:
The normalization constant (\(p(y)\) above) ensures the posterior distribution is a valid distribution, but the posterior density function can be written without this constant. The resulting prediction is not a point estimate, but a distribution (Bayes 1763). The Bayesian approach is derived with Bayes’ theorem wherein the posterior distribution, the updated belief about the parameter given the data \(p(\theta|y)\), is proportional to the likelihood of \(\theta\) given \(y\), \(L(\theta|y)\), and the prior density of \(\theta\), \(p(\theta)\). The former is known as the likelihood function and would comprise the new data for analysis while the latter allows for the incorporation of prior knowledge regarding \(\theta\)(Yan and Su 2009).
\[
f(\theta|D) \propto L(\theta|D)f(\theta)
\]
2.3 The Models
To generate a model for our analysis, we start with the normal data model \(Y_i|\beta_0, \beta_1, \sigma \sim N(\mu, \sigma^2)\) and include the mean specific to our predictor, departure time, \(\mu_i\). The model is
The categorical predictor, Day of the Week, is given a similar treatment to the continuous predictor with the sampling of three normal models with different priors:
The general formula is adjusted for the inclusion of regression coefficients for each day of the week \(\beta_1, \beta_2, ..., \beta_6\), where the intercept coefficient \(\beta_0\) reflects Tuesday’s arrival delay. Only one prior for the predictor must be specified as the stan_glm() function handles categorical variables in a similar way to the glm() function, through dummy coding.
2.4 Prior Selection
Since we are only using two continuous variables for the first model, arrival delay and departure time, the regression parameters will be \(\beta_0\), \(\beta_1\), and \(\sigma\) for intercept, slope, and error As intercept and slope regression parameters can take any real value, we will use normal prior models (Johnson, Ott, and Dogucu 2022).
where \(m_0, s_0, m_1, \text{and } s_1\) are hyperparameters.
The standard deviation parameter must be positive, so we will use an exponential model (Johnson, Ott, and Dogucu 2022).
\[
\sigma \sim \text{Exp}(l)
\]
Due to the fact that the exponential model is a special case of the Gamma model, with \(s = 1\), we can use the definitions of the mean and variance of the gamma model to to find that of the exponential model (Johnson, Ott, and Dogucu 2022).
This analysis will explore the effect of different priors on two models. Traditionally, priors are tuned by selecting values for the aforementioned hyperparameters based on prior knowledge about the data being modeled. These hyperparameters are specific to the distributions of the priors. In the case of this study, the mean and standard deviation are estimated from the data in order to tune the normal priors. In addition to the model with manually tuned priors, two other versions of each of the models will be constructed. One version will involve the use of priors tuned by the function used for modeling, “Default Tuned Priors”, and the other version will utilize flat priors.
Tuned Continuous Model
\(\beta_0\) informs the model intercept
Code
summary(Delays_sample$DEP_TIME_MINS) #mean departure time is 809.3 minutes (~ 1:30pm)Delays_sample_filtered_B0 <-subset(Delays_sample, DEP_TIME_MINS >=800& DEP_TIME_MINS <=820)mean(Delays_sample_filtered_B0$ARR_DELAY) #m_0c = 2sd(Delays_sample_filtered_B0$ARR_DELAY) #s_0c = 36
\(\beta_{0c}\) reflects the typical arrival delay at a typical departure time. With a mean departure time at \(\sim\) 1:30pm, the average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 36 minutes.
The slope of the linear model indicates a 0.019 minute increase in arrival delay per minute increase in departure time, so we set \(m_1 = 0.02\). The standard error reflects high confidence at 0.0005, but as to not limit the model we will set it lower at \(s_1 = 0.01\).
\[
\beta_{1} \sim N(0.02, 0.01^2)
\]
\(\sigma\) informs the regression standard deviation
Code
summary(lm_model)$sigma
To tune the exponential model, we set the expected value of the standard deviation, \(E(\sigma)\), equal to the residual standard error, \(\sim 50\). With this, we can find the rate parameter, \(l\).
For arrival delays by the day of the week, mean arrival delays are between 1 and 7 minutes while the median arrival delays are all in the negative, indicating a skew towards larger delays.
\(\beta_{0}\) reflects the mean arrival delay on Tuesday, our reference. The average arrival delay is \(\sim\) 2 minutes with a standard deviation \(\sim\) 46 minutes.
\[
\beta_{0} \sim N(2, 46^2)
\]
\(\beta_j\) informs the model slopes
For a categorical predictor with the stan_glm() function, the tuned prior, \(\beta_j\), is applied to to the estimation of each coefficient associated with the individual levels of the predictor ($_1, _2, …, _6 $). For this reason, we set the coefficient prior to be weakly informative.
\[
\beta_{j} \sim N(0, 50^2)
\]
\(\sigma\) informs the regression standard deviation
Code
lm_model <-lm(ARR_DELAY ~ DAY_OF_WEEK, data = Delays_sample)summary(lm_model)$sigma
To tune the exponential model, we set the expected value of the standard deviation, \(E(\sigma)\), equal to the residual standard error which is the same as with the previous model, \(\sim 50\).
The Bayesian Normal regression model relies on three assumptions that are not dissimilar to those of frequentist linear regression. Primarily, each observation should be independent of any other observation. The relationship between the outcome and predictor variables is expected to be linear. It is also assumed that the variability of observed data is normally distributed around its’ mean with a consistent standard deviation across all levels of the predictor(homoscedasticity)(Johnson, Ott, and Dogucu 2022).
2.7 Statistical Programming
Data was collected by the Bureau of Transportation Statistics (BTS) and accessed through a dataset compiled by Patrick Zelazko (Zelazko 2023). The data was imported into R (R Core Team 2023) via CSV. This is a large time-series dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minutes and attributed reasons for delay. The function stan_glm() was used for simulation of the Normal Bayesian linear regression model from the “rstanarm” library(Brilleman et al. 2018). This function runs the Markov Chain Monte Carlo simulation as well with specified chains, iterations, and the ability to set a seed. These were set to 4 chains, 2000 iterations, and the seed was set to 123. Simulation of the posterior was done with the posterior_predict() function, also from the “rstanarm” library(Brilleman et al. 2018). Evaluation of the model was done by considering the data and it’s source, the assumptions of the model, and the accuracy of the prediction. The posterior predictions were evaluated via k-fold cross validation with the prediction_summary_cv() function from the “bayesrules”library (Dogucu, Johnson, and Ott 2021). This provided median absolute error (MAE) scaled MAE, and the proportion of values that fall within 50% and 95% confidence intervals. Effective sample size (\(N_\text{eff}\)) and R-hat(\(\hat{R}\)) are also calculated using the “bayesplot” package(Gabry et al. 2019).
3 Analysis
3.1 The Dataset
This is a large dataset with with 3 million observations, each a specific flight, and 32 features. The data is from flights within the United States from 2019 through 2023. Diverted and cancelled flights are recorded, as are the time in minuted and attributed reasons for delay. Following are the definitions of the given variables in this dataset.
Header
Description
Fl Date
Flight Date (yyyy-mm-dd)
Airline
Airline Name
Airline DOT
Airline Name and Unique Carrier Code. When the same code has been used by multiple carriers, a numeric suffix is used for earlier users, for example, PA, PA(1), PA(2). Use this field for analysis across a range of years.
Airline Code
Unique Carrier Code
DOT Code
An identification number assigned by US DOT to identify a unique airline (carrier). A unique airline (carrier) is defined as one holding and reporting under the same DOT certificate regardless of its Code, Name, or holding company/corporation.
Fl Number
Flight Number
Origin
Origin Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Origin City
Origin City Name, State Code
Dest
Destination Airport, Airport ID. An identification number assigned by US DOT to identify a unique airport. Use this field for airport analysis across a range of years because an airport can change its airport code and airport codes can be reused.
Dest City
Destination City Name, State Code
CRS Dep Time
CRS Departure Time (local time: hhmm)
Dep Time
Actual Departure Time (local time: hhmm)
Dep Delay
Difference in minutes between scheduled and actual departure time. Early departures show negative numbers.
Taxi Out
Taxi Out Time, in Minutes
Wheels Off
Wheels Off Time (local time: hhmm)
Wheels On
Wheels On Time (local time: hhmm)
Taxi In
Taxi In Time, in Minutes
CRS Arr Time
CRS Arrival Time (local time: hhmm)
Arr Time
Actual Arrival Time (local time: hhmm)
Arr Delay
Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers.
Cancelled
Cancelled Flight Indicator (1=Yes)
Cancellation Code
Specifies The Reason For Cancellation
Diverted
Diverted Flight Indicator (1=Yes)
CRS Elapsed Time
CRS Elapsed Time of Flight, in Minutes
Actual Elapsed Time
Elapsed Time of Flight, in Minutes
Air Time
Flight Time, in Minutes
Distance
Distance between airports (miles)
Carrier Delay
Carrier Delay, in Minutes
Weather Delay
Weather Delay, in Minutes
NAS Delay
National Air System Delay, in Minutes
Security Delay
Security Delay, in Minutes
Late Aircraft Delay
Late Aircraft Delay, in Minutes
Table 1 shows the flights distinguished by a new variable, Flight Period, where each time period is comprised of an 8-hour segment (i.e. Morning has flights with departure times from 4am to noon followed by afternoon and evening). The Afternoon period is comprised of the most flights (47.4%), followed closely by the Morning period (41.5%), and the Evening period trails the two (11%). The table also gives the means of the departure and arrival times, giving an indication of the density of the flights in the given period. The average departure and arrival delays show much better numbers for the Morning period (5.23, -0.77 minutes) with increasing delays for the Afternoon and Evening periods. The delay counts by type show That the Afternoon and Morning periods account for significantly more of the total delays, though that is without taking into account the smaller contribution of flights by the Evening period on the whole.
Table 1: Flight Delay Summary by Flight Period
Flight Period
Flight Period
Morning
Afternoon
Evening
Total
TotalFlightsCount
1246031 (41.5%)
1423140 (47.4%)
330829 (11.0%)
3000000 (100%)
CancelledFlightsCount
30690 (38.8%)
38343 (48.4%)
10107 (12.8%)
79140 (100%)
DivertedFlightsCount
2555 (36.2%)
3901 (55.3%)
600 (8.5%)
7056 (100%)
AvgCRSDepTime
08:49:31
15:73:19
20:66:23
13:27:04
AvgDepTime
08:53:58
15:89:05
20:12:40
13:29:47
AvgDepDelay
5.23
12.93
16.51
10.12
AvgTaxiOut
16.87
16.44
16.65
16.64
AvgTaxiIn
7.75
7.78
6.95
7.68
AvgCRSArrTime
10:87:15
17:85:11
17:42:14
14:90:34
AvgArrTime
10:86:01
17:71:56
15:89:47
14:66:31
AvgArrDelay
-0.77
7.34
10.04
4.26
AvgAirTime
114.12
109.8
116.31
112.31
CarrierDelayCount
86824 (29.2%)
162266 (54.6%)
47861 (16.1%)
296951 (100%)
SecurityDelayCount
887 (32.1%)
1434 (52.0%)
438 (15.9%)
2759 (100%)
WeatherDelayCount
8380 (26.7%)
18758 (59.7%)
4290 (13.7%)
31428 (100%)
NASDelayCount
80604 (31.4%)
144366 (56.3%)
31507 (12.3%)
256477 (100%)
LateAircraftDelayCount
42721 (16.5%)
168902 (65.2%)
47391 (18.3%)
259014 (100%)
Summary includes morning, afternoon, and evening flight periods.
3.2 Exploratory Data Analysis
The histograms in Figure 1 illustrate the frequencies of air time, arrival delays, and departure delays. The y-axis was transformed to make the visualizations more legible. All show a skew to the right. This makes sense for air times with a higher proportion of regional flights and the exclusion of international departures and arrivals. Shorter delays (for both arrivals and departures) being more frequent than longer delays is also to be expected.
Figure 2 shows the average arrival delay for the largest five airlines (filtered for carriers with over 200,000 flights in the given period). The standard deviations for these airlines are fairly small, indicating a low variability in the arrival delays for these airlines.
Figure 3 displays the average arrival delay for flights at their origin airport using “plotly” (Sievert 2020) and “ggmap” packages(Kahle and Wickham 2013). This comes from the idea that a flight’s arrival delay is collinear with it’s departure delay which may be influenced by it’s origin airport. Airport location information sourced from Random Fractals Inc(Novak 2023).
Figures 4 and 5 show the correlation of the continuous variables and the correlation of the categorical and continuous variables respectively using the “corrplot” package(Wei and Simko 2024). Figure 4 doesn’t provide any particularly valuable insight. Arrival delay has an expected positive correlation coefficient with departure delay. There appears to be a slight positive correlation between taxi times and arrival delay and elapsed time which is also expected.
Figure 5 shows the correlation of the continuous and categorical variables using the p-values from the t-tests and ANOVAs of each relation. Relationships with at least one significant p-value are shown in dark blue. This heatmap is of limited value as there appears to be many significant correlations, but without further inspection it is not known which value(s) of the categorical variables have a significant p-value.
Code
p_value_long <-melt(p_value_results, na.rm =TRUE) # Remove NAsggplot(data = p_value_long, aes(x = Var1, y = Var2, fill = value)) +geom_tile(color ="white") +scale_fill_gradient(low ="blue", high ="red", na.value ="gray90", name ="p-value") +labs(x ="Categorical Variable",y ="Continuous Variable",caption ="Heatmap of p-values for categorical vs continuous variables.", tag ="Figure 5") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.caption.position ="plot",plot.caption =element_text(hjust =0), plot.tag.position ="topleft")
3.3 Data Preprocessing
To prepare the data for analysis, a few changes were made. Diverted and cancelled flights were removed as they both have arrival delays set to “NA” due to them not arriving as scheduled. Including cancelled and diverted flights would introduce complexity and could lead to bias in the model. The goal of the model is to understand typical arrival delays, so the inclusion of cancelled and diverted flights is unnecessary. The dataset is also quite large with three million observations, so the dataset was sampled to account for computational efficiency. For the first model, the departure time variable was converted from an “HHMM” format to a “minutes past midnight” format for use by the stan_glm() function.
ggplot(Delays_sample, aes(x = DAY_OF_WEEK, y = ARR_DELAY)) +geom_boxplot(outlier.shape =NA) +geom_point(data = mean_delay_by_day, aes(x = DAY_OF_WEEK, y = mean_arr_delay), color ="red4", size =3, shape =8) +labs(caption ="Arrival delay by the day of the week.",x ="Day of the Week",y ="Arrival Delay (minutes)",tag ="Figure 9" ) +theme_minimal()+ylim(-45,45) +theme(plot.caption.position ="plot",plot.caption =element_text(hjust =0), plot.tag.position ="topleft")
3.4 Modeling
3.4.1 The Normal Data Model: Departure Time Predictor
For the continuous predictor, Departure Time, three normal models with different priors are sampled: - Flat Priors - Rstanarm’s Default Priors - Tuned Priors
The posterior median relationship for the first model with flat priors is
\[
-10.92 + 0.02X.
\]
That is, at \(0 \text{ minutes}\) (i.e. Midnight), the expected delay is \(-10.92\). For every minute past midnight, the delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)). At noon(\(X = 720 \text{ minutes}\)), the expected delay for any flight is \(3.48 \text{ minutes}\) (\(\sim 3 \text{ minutes and } 29\text{ seconds}\)).
library(rstanarm)tuned_model_dt <-stan_glm(ARR_DELAY ~ DEP_TIME_MINS, data = Delays_sample,family =gaussian(),prior_intercept =normal(2, 1296),prior =normal(0.02, 0.0001), prior_aux =exponential(0.02),chains =4, iter =2000, seed =123,refresh =0 )library(broom.mixed)tmdt <-tidy(tuned_model_dt, effects =c("fixed", "aux"),conf.int =TRUE, conf.level =0.95)# Store the 4 chains for each parameter in 1 data framemodel_tuned_df <-as.data.frame(tuned_model_dt)##TUNED# Simulate a set of predictionsset.seed(123)shortcut_prediction <-posterior_predict(tuned_model_dt, newdata =data.frame(DEP_TIME_MINS =720))ggarrange(mcmc_dens(shortcut_prediction) +xlab("Delay (minutes)") +ggtitle("Figure 12 (a) Predicted Arrival Delays for a Departure Time of Noon, Tuned Priors"))
Code
model <- tuned_model_dtplotInt <-mcmc_trace(model, pars ="(Intercept)") +labs(title ="Figure 12 (b) MCMC Trace", subtitle ="Intercept")plotBeta <-mcmc_trace(model, pars ="DEP_TIME_MINS") +labs(subtitle ="Departure Time (mins)")plotSig <-mcmc_trace(model, pars ="sigma") +labs(subtitle ="Sigma")ggarrange(plotInt, plotBeta, plotSig, ncol =1, align ="v", heights =c(4, 4, 4))
That is, on Tuesday one can expect a delay of \(1.57\) minutes (\(1\) minute and \(34\) seconds). One can expect the longest delay on Wednesday at \(6\) minutes and \(29\) seconds.
This version of the second model showed little change for the expectations of Tuesday through Thursday, but the rest of the week had shifts of more than a minute for all day except Sunday with a change of \(53\) seconds.
The tuned version of the second model showed coefficients more similar to that of the flat model.
3.5 Modeling Results
A total of 100,000 observations were included in the model creation for this study. As arrival delay was the outcome of interest, cancelled and diverted flights were excluded from this study for reasons to include missing values and extenuating circumstances that would introduce bias. Following, in Table 2, is a summary of the regression coefficients, their standard error, and 95% credible intervals from the Bayesian regression.
Table 2. Estimations of the Posterior Distributions’ Regression Coefficients.
Mean
SE
95% CI
Model 1: Continuous Predictor
Flat Priors
𝛽₀ Intercept
-10.92
0.47
(-11.85; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.86)
𝜎
51.09
0.12
(50.86; -11.86)
Default Tuned Priors
𝛽₀ Intercept
-10.94
0.47
(-11.86; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.86)
𝜎
51.09
0.12
(50.86; -12.02)
Tuned Priors
𝛽₀ Intercept
-11.66
0.17
(-12.02; 0.02)
𝛽₁ Departure Time
0.02
0.00
(0.02; 50.87)
𝜎
51.10
0.12
(50.87; 0.00)
Model 2: Categorical Predictor
Flat Priors
𝛽₀ Intercept (Tuesday)
1.57
0.44
(0.69; 3.72)
𝛽₅ Sunday
4.36
0.61
(3.21; 1.72)
𝛽₆ Monday
2.93
0.65
(1.72; 51.16)
𝛽₁ Wednesday
4.92
0.61
(3.72; 1.47)
𝛽₂ Thursday
2.66
0.65
(1.47; 0.69)
𝛽₃ Friday
1.86
0.62
(0.69; 2.25)
𝛽₄ Saturday
3.43
0.61
(2.25; 3.21)
𝜎
51.39
0.12
(51.16; 0.73)
Default Tuned Priors
𝛽₀ Intercept (Tuesday)
1.54
0.40
(0.73; 3.21)
𝛽₅ Sunday
3.48
0.58
(2.31; 0.74)
𝛽₆ Monday
1.89
0.60
(0.74; 51.17)
𝛽₁ Wednesday
4.40
0.60
(3.21; 1.48)
𝛽₂ Thursday
2.69
0.60
(1.48; 1.75)
𝛽₃ Friday
2.97
0.64
(1.75; 3.77)
𝛽₄ Saturday
4.96
0.59
(3.77; 2.31)
𝜎
51.40
0.12
(51.17; 0.66)
Tuned Priors
𝛽₀ Intercept (Tuesday)
1.54
0.44
(0.66; 3.76)
𝛽₅ Sunday
4.41
0.61
(3.23; 1.73)
𝛽₆ Monday
2.99
0.64
(1.73; 51.17)
𝛽₁ Wednesday
4.96
0.64
(3.76; 1.48)
𝛽₂ Thursday
2.70
0.61
(1.48; 0.71)
𝛽₃ Friday
1.91
0.64
(0.71; 2.28)
𝛽₄ Saturday
3.50
0.63
(2.28; 3.23)
𝜎
51.39
0.12
(51.17; 0.00)
The three versions of Model 1, with departure time as predictor, agree on a \(\beta_1\) coefficient of 0.02 indicating that the delay is expected to increase by \(0.02 \text{ minutes}\) (\(1.2 \text{ seconds}\)) for each minute past midnight. There is some discrepancy with the intercept term \(\beta_0\) with values ranging from -10.92 to -11.66 minutes. Standard deviation is large comparatively at \(\sim\) 51.09 minutes.
There is more variability apparent in the summary of Model 2, though the interceptterm is consistent among the three versions of the model at 1.54 to 1.57 (a matter of 1.8 seconds). Daily estimated delays range from 1.86 to 4.96. Again, standard deviation is large comparatively at \(\sim\) 51.39 minutes.
3.6 K-Fold Cross Validation
Cross validation was done to provide an estimation of model performance. The provided posterior predictive summary is comprised of the median absolute error (MAE) which measures the typical difference between observed and posterior predictive means, the scaled MAE (scaled_MAE) which scales MAE by standard deviations, and the proportion of observations that fall within the \(50\%\) and \(95\%\) posterior prediction intervals. The cross validation assessment (prediction_summary_cv()) was chosen over the posterior prediction summary assessment (prediction_summary()), both from the “bayesrules” package(Dogucu, Johnson, and Ott 2021), as the cross validation procedure provides a more conservative estimate of model performance. Cross validation can also be more computationally efficient with a lower number of folds and the use of a smaller dataset.
Code
library(bayesrules)# #Posterior Predictive Summary# # set.seed(123)# ps_fmdt <- prediction_summary(flat_model_dt, data = Delays_sample)# # set.seed(123)# ps_dmdt <- prediction_summary(default_model_dt, data = Delays_sample)# # set.seed(123)# ps_tmdt <- prediction_summary(tuned_model_dt, data = Delays_sample)# # ps_fmdt <- ps_fmdt %>% # mutate(Model = "fmdt")# ps_dmdt <- ps_dmdt %>% # mutate(Model = "dmdt")# ps_tmdt <- ps_tmdt %>% # mutate(Model = "tmdt")# # ps_results <- bind_rows(ps_fmdt, ps_dmdt, ps_tmdt)# # print(ps_results)#K-fold Cross Validation with k=5, data n=1000set.seed(123)sample_size <-1000Delays_sample2 <- Delays_sample %>%sample_n(sample_size)set.seed(123)cv_fmdt <-prediction_summary_cv(model = flat_model_dt, data = Delays_sample2, k =5)set.seed(123)cv_dmdt <-prediction_summary_cv(model = default_model_dt, data = Delays_sample2, k =5)set.seed(123)cv_tmdt <-prediction_summary_cv(model = tuned_model_dt, data = Delays_sample2, k =5)cv_fmdt <- cv_fmdt$cv %>%mutate(Model ="fmdt")cv_dmdt <- cv_dmdt$cv %>%mutate(Model ="dmdt")cv_tmdt <- cv_tmdt$cv %>%mutate(Model ="tmdt")cv_results <-bind_rows(cv_fmdt, cv_dmdt, cv_tmdt)print(cv_results)
Code
#Posterior Predictive Summary# # set.seed(123)# ps_fmdow <- prediction_summary(flat_model_dow, data = Delays_sample)# # set.seed(123)# ps_dmdow <- prediction_summary(default_model_dow, data = Delays_sample)# # set.seed(123)# ps_tmdow <- prediction_summary(tuned_model_dow, data = Delays_sample)# # ps_fmdow <- ps_fmdow %>% # mutate(Model = "fmdow")# ps_dmdow <- ps_dmdow %>% # mutate(Model = "dmdow")# ps_tmdow <- ps_tmdow %>% # mutate(Model = "tmdow")# # ps_results2 <- bind_rows(ps_fmdow, ps_dmdow, ps_tmdow)# # print(ps_results2)#K-fold Cross Validation with k=5, data n=1000set.seed(123)sample_size <-1000Delays_sample2 <- Delays_sample %>%sample_n(sample_size)set.seed(123)cv_fmdow <-prediction_summary_cv(model = flat_model_dow, data = Delays_sample2, k =5)set.seed(123)cv_dmdow <-prediction_summary_cv(model = default_model_dow, data = Delays_sample2, k =5)set.seed(123)cv_tmdow <-prediction_summary_cv(model = tuned_model_dow, data = Delays_sample2, k =5)cv_fmdow <- cv_fmdow$cv %>%mutate(Model ="fmdow")cv_dmdow <- cv_dmdow$cv %>%mutate(Model ="dmdow")cv_tmdow <- cv_tmdow$cv %>%mutate(Model ="tmdow")cv_results2 <-bind_rows(cv_fmdow, cv_dmdow, cv_tmdow)print(cv_results2)
Table 3. Posterior predictive results from cross validation.
Assumption 2 & 3: scatterplot of raw data, pp_check
# Examine 50 of the 20000 simulated samples
pp_check(bike_model, nreps = 50) +
xlab("rides")
Accuracy of Posterior Predictive Models
discussion on MCMC simulations -> trace, overlay, acf
Neff > (0.1)*N neff_ratio()
5 Conclusion
Summarize your key findings.
Discuss the implications of your results.
References
Bayes, T. 1763. “An Essay Towards Solving a Problem in the Doctrine of Chances. 1763.”
Brilleman, SL, MJ Crowther, M Moreno-Betancur, J Buros Novik, and R Wolfe. 2018. “Joint Longitudinal and Time-to-Event Models via Stan.”https://github.com/stan-dev/stancon_talks/.
Dogucu, Mine, Alicia Johnson, and Miles Ott. 2021. Bayesrules: Datasets and Supplemental Functions from Bayes Rules! Book. https://github.com/bayes-rules/bayesrules.
Gabry, Jonah, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. 2019. “Visualization in Bayesian Workflow.”J. R. Stat. Soc. A 182: 389–402. https://doi.org/10.1111/rssa.12378.
Grolemund, Garrett, and Hadley Wickham. 2011. “Dates and Times Made Easy with lubridate.”Journal of Statistical Software 40 (3): 1–25. https://www.jstatsoft.org/v40/i03/.
Johnson, Alicia A, Miles Q Ott, and Mine Dogucu. 2022. Bayes Rules!: An Introduction to Bayesian Modeling with R. Chapman & Hall.
Lesaffre, Emmanuel, and Andrew B Lawson. 2012. Bayesian Biostatistics. 1st ed. Somerset: John Wiley & Sons, Ltd. https://doi.org/https://doi.org/10.1002/9781119942412.
R Core Team. 2023. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sievert, Carson. 2020. Interactive Web-Based Data Visualization with r, Plotly, and Shiny. Chapman; Hall/CRC. https://plotly-r.com.